Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure socket path is < 104 characters when STATE_PATH is set. #4909

Merged
merged 11 commits into from
Jun 21, 2024

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Jun 11, 2024

What does this PR do?

It ensures the statePath set by the container command will always generate an Unix socket path that is smaller than 108 characters. This ensures the Unix socket can be created and the Elastic-Agent always successfully starts.

Why is it important?

Unix socket paths need to be less than 104 characters long, if they're not Elastic-Agent will fail to create a socket and will not start. When it fails it logs those messages:

{"log.level":"error","@timestamp":"2024-06-10T21:34:44.818Z","log.logger":"control","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/control/v2/server.(*Server).Start","file.name":"server/server.go","file.line":85},"message":"unable to create listener: listen unix /tmp/TestContainerCMDEventToStderr1447633574/001/elastic-agent-8.15.0-SNAPSHOT-linux-x86_64/data/elastic-agent.sock: bind: invalid argument","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-06-10T21:34:44.818Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.logReturn","file.name":"cmd/run.go","file.line":147},"message":"listen unix /tmp/TestContainerCMDEventToStderr1447633574/001/elastic-agent-8.15.0-SNAPSHOT-linux-x86_64/data/elastic-agent.sock: bind: invalid argument","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

The Elastic-Agent container command will not always respect the STATE_PATH environment variable. In cases where it would lead to a Unix socket path too long, which prevents the Elastic-Agent from starting, the default path /usr/share/elastic-agent/state is used.

How to test this PR locally

  1. Get a Stack + Fleet deployed
  2. Set the following environment variables according to your deployment:
    • FLEET_ENROLL
    • FLEET_URL
    • FLEET_ENROLLMENT_TOKEN
  3. Set STATE_PATH to be longer than 100 characters
  4. Run the Elastic-Agent container command
  5. Assert:
    i. The Elastic-Agent runs and can communicate with Fleet
    ii. The state path is the default path :/usr/share/elastic-agent/state

Another option is to run the integration test TestContainerCMDWithAVeryLongStatePath from testing/integration/container_cmd_test.go

## Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

Copy link
Contributor

mergify bot commented Jun 11, 2024

This pull request does not have a backport label. Could you fix it @belimawr? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@belimawr belimawr force-pushed the fix-container-socket-path branch from 3859da7 to 8679507 Compare June 11, 2024 17:33
@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 11, 2024
@belimawr belimawr marked this pull request as ready for review June 12, 2024 21:07
@belimawr belimawr requested a review from a team as a code owner June 12, 2024 21:07
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@belimawr
Copy link
Contributor Author

Quality Gate failed Quality Gate failed

Failed conditions 0.0% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

The changes are covered by integration tests.

@belimawr belimawr changed the title Ensure socket path when STATE_PATH is set is < 104 characters long Ensure socket path is <108 characters when STATE_PATH is set. Jun 13, 2024
@belimawr belimawr changed the title Ensure socket path is <108 characters when STATE_PATH is set. Ensure socket path is < 108 characters when STATE_PATH is set. Jun 13, 2024
belimawr added 5 commits June 13, 2024 10:43
This commit fixes the implementation on Windows and puts the
container integration tests in the container group again. The test for
long state paths will create files outside of the test directory.

It also cleans up the default folder used as state path.
@belimawr belimawr force-pushed the fix-container-socket-path branch from 6dcc963 to e46948b Compare June 17, 2024 21:36
@belimawr belimawr requested a review from cmacknz June 17, 2024 21:40
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have a couple of nitpicks

Also if the STATE_PATH is too long and the default is used, would that be clear in diagnostics?

internal/pkg/agent/cmd/container.go Outdated Show resolved Hide resolved
testing/integration/container_cmd_test.go Outdated Show resolved Hide resolved
@belimawr belimawr changed the title Ensure socket path is < 108 characters when STATE_PATH is set. Ensure socket path is < 104 characters when STATE_PATH is set. Jun 18, 2024
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, no major blockers, Just a nitpick on a direct cast for the agent cmd in test.

While I understand that this may not work for everyone, I feel that the state path should be mounted on a well known location as a local persistent volume or some other volume stored on the host that should have the same lifetime as the deployment: having the STATE_PATH env var does not feel very container/cloud-native..

testing/integration/container_cmd_test.go Show resolved Hide resolved
@belimawr belimawr requested a review from michel-laterman June 18, 2024 15:37
expectedStatePath: "/tmp/tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt",
expectedSocketPath: "/tmp/elastic-agent/oKUUJbxrLlGSh3z6wZWYleLeMuUN4P0_.sock",
},
"long path": { // Path too long to create a unix socket, it will use /tmp/elastic-agent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't most of these test cases just tests of the SocketURLWithFallback() function? How long does this new set of tests take to run?

Do you get the same effective test coverage if you update the single existing container command to use a path longer than 107 characters as the exception path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't most of these test cases just tests of the SocketURLWithFallback() function? How long does this new set of tests take to run?

No, they're testing the https://github.com/elastic/elastic-agent/blob/a6d91b029ae8c0b8ccf7bb391c8ac304fe1e9368/internal/pkg/agent/cmd/container.go#L773-L837 function and how it decides to create the socket path. The fact that it calls SocketURLWithFallback is an implementation detail. It also needs to create some folders:

All of this gets tested by the integration test.

Maybe we can remove one test case: 107 characters path as it essentially is the same as small path.

@cmacknz
Copy link
Member

cmacknz commented Jun 18, 2024

Once this compiles, make sure the elastic-agent status command still works inside the container.

@belimawr
Copy link
Contributor Author

Once this compiles, make sure the elastic-agent status command still works inside the container.

It works:

% k exec -it elastic-agent-vspdt -- /bin/bash
�]0;root@kind-control-plane: /usr/share/elastic-agentroot@kind-control-plane:/usr/share/elastic-agent# elastic-agent status
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   └─ status: (HEALTHY) Running
�]0;root@kind-control-plane: /usr/share/elastic-agentroot@kind-control-plane:/usr/share/elastic-agent# 

Copy link

Quality Gate failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@jlind23 jlind23 merged commit 8a30900 into elastic:main Jun 21, 2024
12 of 13 checks passed
@jlind23
Copy link
Contributor

jlind23 commented Jun 21, 2024

Everything works except he sonar unit code coverage which is superseded by the integration tests hence force merging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants